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Abstract 

The naive importance sampling estimator, based on samples from a single importance 
density, can be numerically unstable. Instead, we consider generalized importance sampling 
estimators where samples from more than one probability distribution are combined. We 
study this problem in the Markov chain Monte Carlo context, where independent samples 
are replaced with Markov chain samples. If the chains converge to their respective target 
distributions at a polynomial rate, then under two finite moment conditions, we show a central 
limit theorem holds for the generalized estimators. Further, we develop an easy to implement 
method to calculate valid asymptotic standard errors based on batch means. We also provide a 
batch means estimator for calculating asymptotically valid standard errors of Gey erf s (1994 1 
reverse logistic estimator. We illustrate the method via three examples. In particular, the 
generalized importance sampling estimator is used for Bayesian spatial modeling of binary 
data and to perform empirical Bayes variable selection where the batch means estimator 
enables standard error calculations in high-dimensional settings. 

Key words and phrases: Bayes factors, Markov chain Monte Carlo, polynomial ergodicity, 
ratios of normalizing constants, reverse logistic estimator. 


1 Introduction 

Let 7r(x) = i>{x)/m be a probability density function (pdf) on X with respect to a measure /i(-). 
Suppose / : X —)■ M is a TT integrable function and we want to estimate := f{x)7r{x)p,{dx). 
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Let TTi{x) = yi{x)/mi be another pdf on X sueh that {x : 7ri(a:) = 0} C {a: : 7r(a:) = 0}. The 
importanee sampling (IS) estimator of E.„f based on independent and identieally distributed (iid) 
samples Xi ,from the importanee density tti is 




f{x)u{x)/m 

ui{x)/mi 


ni{x)^{dx) / / , ' , — ni{x) fi{dx) = E^f, 

(1.1) 

as n —)■ oo. This estimator ean also be used in the Markov chain Monte Carlo (MCMC) context 
when Xi,..., Xn are realizations from a suitably irreducible Markov chain with stationary den¬ 
sity TTi (Hastings ( 1970| )). Note that ( |1.1| ) requires the functions u and ui to be known. On the 
other hand, it does not depend on normalizing constants m and mi, which are generally unknown. 

In this article, we consider situations where one wants to estimate for all tt belonging 
to a large collection, say 11. This situation arises in both frequentist and Bayesian statistics. 
Although ( |1.1| ) provides consistent estimators of E^^f for all tt G 11 based on a single Markov 
chain {Xn}n>o with stationary density tti, it does not work well when vr differs greatly from tti. 
In that case the ratios u{x)/ui{x) can be arbitrarily large for some sample values making the 
estimator at ( |1.1[ ) unstable. In general, there is not a single good importance density tti which 
is close to all TT G n (see e.g. Geyer (1994)). Hence a natural modification is to replace tti in 
CD with a mixture of densities where each density in 11 is close to a subset of the k reference 
densities. To this end, denote If = where a = {ai,...,ak) are k positive 

constants, \a\ = a*, and 7ri(a;) = Ui{x)/mifori = 1,..., are/c densities known up to their 

normalizing constants. Suppose further that rii,... ,nk are positive integers and di := rrii/mi for 
i = 2,..., /c, with di = 1. Then define the {k — 1) dimensional vector 

d = (m2/mi,... ,mfc/mi). (1.2) 

Finally for / = 1,..., fc, let be an iid sample from vr^ or realizations from a positive 

Harris Markov chain with invariant density tii (for definitions see |Meyn and Tweedie] ( [199^ ). 
Then as ni —)■ oo, for all / = 1,..., fc, we have 


V = 


ni 




(0^ 


P^x) 



ni 


i,(X, 


0)i 


hXk eL. aMxf’)/d 



f{x)'X\n{x) ii(dx 
tt{x) 


Yxs=lO'sl^six)/d, 


7ll{x) ^{dx 



u^x) 


X asi^s{x)/d, 


(1.3) 


-7r;(a;) ^{dx) 


lx T^{x) 


7r(x) jji^dx) = i? 7 r/. 


The generalized IS estimator (|1.3[) has been discussed widely in the literature, e.g. applica¬ 


tions include Monte Carlo maximum likelihood estimation and Bayesian sensitivity analysis. Gill 
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et al.| ( [198^ , |Kong et al.| ( [2003] ), |Meng and Wong| ( |1996| ), |Tan| ( |2004[ ), and |Vardi| ( |1985[ ) consider 


estimation using ( |1.3| ) based on iid samples. The estimator is applieable to a mueh larger elass 
of problems if Markov ehain samples are allowed, see e.g. |Buta and Doss| ( |2011| ), [Geyer| ( |1994| ), 
and |Tan et al .1 ( 12015] ), whieh is the setting of this paper. 

Alternative importanee weights have also been proposed. In the ease when the normalizing 


eonstants m/s are known, the estimator (1.3) resembles the balance heuristic estimator of Veaeh 


and Guibas | ( |1995l ), whieh is revisited in [Owen and ZhoiT] ( |2000| ) as the deterministic mixture. 


The standard population Monte Carlo algorithm of Cappe et al. (2004) uses a weighted ratio of 
the target vr and the proposal tij it was drawn from (evaluated at the sample itself). However, 


when iid samples are available from 7ij,j = 1,2,... ,k, Elvira et al. (2015) have shown that 
the normalized estimator (m/s known) version of ( |1.3| ) always has a smaller varianee than that 
of the population Monte Carlo algorithm. Further, it may be diffieult in praetiee to find fully 
known importanee densities that approximate the target densities. Indeed, applieations sueh as in 
empirieal Bayes analysis and Bayesian sensitivity analysis routinely seleet representatives from 
the large number of target posterior densities to serve as proposal densities, and they are known 
only up to normalizing eonstants. See |Buta and Doss| ( |2011[ ), |Doss| ( |2010| ), as well as seetionj^ 
for examples. Although there is no known proof for the self normalized estimator ([Elvira et al.[ 


2015 p. 16), it is reasonable to assume the superiority of (1.3) over estimators eorresponding to 


other weighting sehemes. 


As noted in ( |1.3| ), the estimator fj eonverges to Et,/ as the sample sizes inerease to infinity, 
for iid samples as well as Markov ehain samples satisfying the usual regularity eonditions. Now 
given samples of finite size in praetiee, it is of fundamental importanee to provide some measure 
of uneertainty, sueh as the standard errors (SEs) assoeiated with this eonsistent estimator. Take 
estimators that are sample averages based on iid Monte Carlo samples for example, it is a basie 
requirement to report their SEs. But the very same issue is often overlooked in praetiee when the 
estimators have more eomplieated strueture, and when they are based on MCMC samples, largely 


due to the diffieulty of doing so. See, for e.g. Flegal et al. (2008) on the issue eoneeming MCMC 


experiments and Koehler et al. (2009) for more general simulation studies. For ealeulating SEs 
of f) based on MCMC samples, [Tan et al. ( [2015[ ) provide a solution using the method of regener¬ 
ative simulation (RS). However, this method erueially depends on the eonstruetion of a praetieal 
minorization eondition, i.e. one where suffieient regenerations are observed in finite simulations 
(for definitions and a deseription of RS see Mykland et al.[ ( T995] )). Further, the usual method of 
identifying regeneration times by splitting beeomes impraetieal for high-dimensional problems 
( [Gilks et aL||1998[ ). Henee, sueeessful applieations of RS involve signifieant trial and error and 
are usually limited to low-dimensional Gibbs samplers (see e.g. [Roy and HoberF (2007); Tan and 
































































































Robert (20091). In this paper we avoid RS and provide SE estimators of fj using the batch means 


(BM) method, which is straightforward to implement and can be routinely applied in practice. 
In obtaining this estimator, we also establish a central limit theorem (GET) for f) that generalizes 
some results in |Buta and Doss| ( |2011| ). 

The estimator fj in ( |1.3[ ) depends on the ratios of normalizing constants, d, which are unknown 
in practical applications. We consider the two-stage scheme studied in |Buta and Doss| ( |2011| ) 
where first an estimate d is obtained using Geyer’s (|l994| “reverse logistic regression” method 


based on samples from tti, and then independently, new samples are used to estimate for tt G 

(20111 showed that the asymptotic variance 


Buta and Doss 


n using the estimator T){d) in ( |1.3[ ). 
of 57 (d) depends on the asymptotic variance of d. Thus we study the GET of d and provide a 
BM estimator of the asymptotic covariance matrix of d. Since d involves multiple Markov chain 
samples, we utilize a multivariate BM estimator. Although, the form of the asymptotic covariance 
matrix of d is complicated, our consistent BM estimator is straightforward to code. 

The problem of estimating d, the ratios of normalizing constants of unnormalized densities 
is important in its own right and has many applications in frequentist and Bayesian inference. 
Eor example, when the samples are iid sequences this is the biased sampling problem studied in 


Vardi (1985). In addition, the problem arises naturally in the calculations of likelihood ratios in 


missing data (or latent variable) models, mixture densities for use in IS, and Bayes factors. 


Our work considers the problem of estimating d using Geyer’s (1994) reverse logistic re¬ 


gression method. Specifically, we study the general quasi-likelihood function proposed in Doss 


and Tan (2014). Unlike Geyer’s (1994) method, this extended quasi-likelihood function has the 


advantage of using user defined weights which are appropriate in situations where the multiple 
Markov chains have different mixing rates. We establish the GET for the resulting estimators of 
d and develop the BM estimators of their asymptotic covariance matrix. 

Thus we consider two related problems in this paper-firstly, estimating (ratios of) normalizing 
constants given samples from k densities, and secondly, estimating expectations with respect to a 
large number of (other) target distributions using these samples. In both cases, we establish GETs 
for our estimators and provide easy to calculate SEs using BM methods. 

Prior results of Buta and Doss ( |2011 ), Doss and Tan (2014), Geyer ( |1994 ), and Tan et al. 


(2015) all assume that the underlying Markov chains are geometrically ergodic. We weaken this 
condition substantially in that we only require the chains to be polynomial ergodic. To this end, 
let Ki{x, ■) be the Markov transition function for the Markov chain = {xf^} 4 >i, so that for 


any measurable set A, and s, f G {1,2,...} we have G A | xi ^ = x) = Kf{x, A). 

Eet II ■ II denote the total variation norm and 11; be the probability measure corresponding to the 
density tt;. The Markov chain <1); is polynomially ergodic of order m where m > 0 if there exists 
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W : X ^ M+ with E^, W < oo such that 


TTj ' 


There is substantial MCMC literature establishing that Markov ehains are at least polynomially 
ergodie (see |Vats et al.| ( |2015^ and the referenees therein). 

We illustrate the generalized IS method and importanee of obtaining SEs through three exam¬ 
ples. First, we eonsider a toy example to demonstrate that BM and RS estimators are eonsistent 
and investigate the benefit of allowing general weights to be used in generalized IS. Seeond, we 
eonsider a Bayesian spatial model for a root rot disease dataset where we illustrate the importanee 
of ealeulating SEs by eonsidering different designs and performing samples size ealeulations. Fi¬ 
nally, we eonsider a standard linear regression model with a large number of variables and use 
the BM estimator developed here for empirieal Bayes variable seleetion. 

The rest of the paper is organized as follows. Seotion|^is devoted to the important problem 
of estimating the ratios of normalizing eonstants of unnormalized densities, that is estimating 
d. Seetionj^eontains the eonstruetion of a GET for fj and deseribes how valid SEs of r) ean be 
obtained using BM. Seetions|^and Appendix [^eontain toy examples illustrating the benefits of 
different weight funetions. Seetion eonsiders a Bayesian spatial models for binary responses. 
The empirieal Bayes variable seleetion example is eontained in Appendix]^ We eonelude with a 
diseussion in Seetionj^ All proofs are relegated to the appendiees. 


2 Estimating ratios of normalizing constants 


Consider k densities tt; = = 1,... ,k with respeet to the measure fx, where the ui's 

are known funetions and the mi's are unknown eonstants. For eaeh I we have a positive Harris 
Markov ehain with invariant density vr^. Our objeetive is to estimate all 


possible ratios mi/mj, i ^ j or, equivalently, the veetor d defined in (|1.2[). 


Geyer’s (1994) reverse logistie regression is deseribed as follows. Eet n = '^ni and set 
ai = ni/n for now. For I = 1,... ,k define the veetor by 


0 = - log(mz) -f log(ai) 


and let 


Pi{x,C) = 


ui{x)e‘^‘ 


Eli^six)eC^ 


( 2 . 1 ) 


Given the value x belongs to the pooled sample i = 1,..., n/, I = 1,..., A;}, p;(x, C) is 

the probability that x eame from the distribution. Of eourse, we know whieh distribution the 
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sample x came from, but here we pretend that the only thing we know about x is its value and 
estimate by maximizing the log quasi-likelihood function 


ni 




( 2 . 2 ) 


1=1 i=l 


with respect to C. Since C has a one-to-one correspondence with m = (mi,..., m^), by estimat¬ 
ing ^ we can estimate m. 


As Geyer (1994) mentioned, there is a non-identifiability issue regarding for any con¬ 


stant c G M, is same as /„(<^4-c1a;) where is the vector of fc I’s. So we can estimate the true 
^ only up to an additive constant. Thus, we can estimate m only up to an overall multiplicative 
constant, that is, we can estimate only d. Let Co ^ be defined by [Co]« = [C]i ~ (Z]s=i[C]s)/fc, 


the true normalized to add to zero. Geyer (1994) proposed to estimate <^g by the maximizer 
of In subject to the linear constraint = 0, and thus obtain an estimate of d. The estima¬ 


tor d (written explicitly in Section 2.1), was introduced by Vardi (1985), and studied further by 
Gill et al. (|l988), who proved that in the iid setting, d is consistent and asymptotically normal. 


and established its efficiency. Geyer (1994) proved the consistency and asymptotic normality of 
d when are k Markov chains satisfying certain mixing conditions. In the iid setting, 

Meng and Wong (1996), Kong et al. (2003), and Tan ( |2004 ) rederived the estimate under different 
computational schemes. 

However, none of these articles discuss how to consistently estimate the covariance matrix 
of d, even in the iid setting. Recently Doss and Tan (2014) address this important issue and 


obtain a RS estimator of the covariance matrix of d in the Markov chain setting. 


Doss and Tan 


(2014) also mention the optimality results of Gill et al. ( |1988 ) do not hold in the Markov chain 
case. In particular, when using Markov chain samples, the choice of the weights aj = rij/n to 
the probability density in the denominator of ( |2.1[ ) is no more optimal and should instead 

incorporate the effective sample size of different chains as they might have quite different rates 
of mixing. They introduce the following more general log quasi-likelihood function 


ni 


^^(0 = 5^w;,^log(pz(xf\C)), 


(2.3) 


/=1 


i=l 


where the vector re G is defined by wi = ain/ni for I = 1,..., A; for an arbitrary probability 
vector a. (Note the change of notation from I to i.) Clearly if ai = rii/n, then wi = 1 and ( |2.3| ) 


becomes (2.2). 


When RS can be used, [Doss and Tan (2014) proved the consistency (to the true value Cg) 


and asymptotic normality of the constrained maximizer C, (subject to the constraint = 0) of 
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( |2.3| ) under geometric ergodicity. They also obtain a RS estimator of the asymptotic covariance 
matrix and describe an empirical method for choosing the optimal a based on minimizing the 
trace of the estimated covariance matrix of d. However, their procedure requires a practical mi- 
norization condition for each of the k Markov chains, which can be extremely difficult. Without 
a minorization condition, we show d is a consistent estimator of d, show d satisfies a CLT under 
significantly weaker mixing conditions, and provide a strongly consistent BM estimator of the 
covariance matrix of d. 


2.1 Central limit theorem and asymptotic covariance estimator 

Within each Markov chain I = 1,... ,k, assume —)■ cxd in such a way that rii/n si E (0,1). 

In order to obtain the CLT result for d, we first establish a CLT for Note that the function 
—)■ that maps Cg d is given by 

gCi-Ca^s/Qi 


9iC) = 


(2.4) 


^^ak/axj 


and its gradient at <^q (in terms of d) is 



( d2 

da 

dk 


—d2 

0 ... 

0 

D = 

0 

-d, ... 

0 


V 0 

0 ... 

—dkj 


(2.5) 


Since d = g{Co)^ ^nd by definition d = g{C)^ we can use the CLT result of C to get a CLT for d. 
First, we introduce the following notations. For r = 1,..., fc, let 


= p,(xf\ Co) - {Pr{X, Co)), ^ = 1,..., nz. 


( 2 . 6 ) 


The asymptotic covariance matrix in the CLT of C, involves two k x k matrices B and fl, which 
we now define. The matrix B is given by 


Err = ^ ajE^. {pr{X, C) [l - Pr(2f, C)]) and 
j = ^ 

k 

Ers = ajE^. {pr{X, C)Ps{X, 0) for r 7 ^ s. 


(2.7) 


i=i 
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Let f2 be the k x k matrix defined (for r,s = 1,..., fc) by 


k 9 

af " 


oo oo 

= E 7 + E E^,{Yk"Y!:‘''} + Y1 E^.iYk'M'-''’} 

1=1 i=l i=l 


( 2 . 8 ) 


Remark 1. The right hand side of ( |2.8[ ) involves terms of the form | I . 

For any fixed /, r, s and i, the two expeetations are the same if xf^ and are exchangeable, 

e.g. if the chain is reversible. In general, the two expectations are not equal. 

The matrix B will be estimated by its natural estimate B defined by 


1=1 


ni 




'(0 


ni ^ ^ 
1=1 


Brfi 


'rs / j Oil 

1=1 


ni 


(2.9) 


— 5^Pr(xf\C)p,(xf\C) ) forr 7^ s. 


(0 


ni 


i=l 


To obtain a BM estimate 12, suppose we simulate the Markov chain <I); for ni = efii iterations 
(hence ei = and bi = are functions of rii) and define for r, Z = 1 ,..., fc 

(m+l)b; 

ez - 1 . 


^ Pr{xj\c) form = 0,..., I 

^ j=mbi+l 


T 


Now set Zm = (Zm’‘\ ..., for m = 0,..., 6; — 1. For / = 1,..., Zc, denote = 


..., where = YlT=iPr(^i^ ^ C)/fo- Let 


El'l = -Xy^‘rzW_Z(0l feO-zcT for ; = , 7 . 


ez - 1 


Let 


m=0 


s = 


( 2 . 10 ) 


/sO) 


V 


0 ^ 

gw J 


( 2 . 11 ) 


and define the following k x ki^ matrix 


A^=iy-^^a^h ... 

where 4 denotes the k x k identity matrix. Finally, define 



\ I ®A:4 

nk 


( 2 . 12 ) 


12 = AffEA^. 


(2.13) 
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We are now ready to deseribe eonditions that ensure strong eonsistency and asymptotie nor¬ 
mality of d. The following theorem also provides eonsistent estimate of the asymptotic covari¬ 
ance matrix of d using BM method. Consistency of d holds under minimal assumptions, i.e. if 
$ 1 ,..., are positive Harris chains. On the other hand, CLTs and consistency of BM estimator 
of asymptotic covariance require some mixing conditions on the Markov chains. For a square 
matrix C, let denote the Moore-Penrose inverse of C. 

Theorem 1 Suppose that for each I = 1,..., fc, the Markov chain {xf ^, • • •} has invariant 
distribution tt/. 


1. If the Markov chains $ 1 , are positive Harris, the log quasi-likelihood function ( |2.3| ) 

has a unique maximizer subject to the constraint = 0. Let ^ denote this maximizer, 
and let d = g{C)- Then d d as Hi ,..., rifc —>■ 00 . 


2. If the Markov chains ^i ,..., are polynomially ergodic of order m > 1, asrii,... , 71 ^ —)■ 

cx>, y/n{d -d) A Ar(0, V) where V = 


3. Assume that the Markov chains $ 1 ,..., are polynomially ergodic of order m > 1 and 


for all I = 1,... ,k, bi = [n^J where 1 > z/ > 0. Let D be the matrix D in ( |2.5| ) 
with d in place of d, and let B and O be defined by ( |2.9| ) and ( |2.13[ ), respectively. Then, 
V := B^flBlD is a strongly consistent estimator ofV. 


3 IS with multiple Markov chains 


This section considers a CLT and SEs for the generalized IS estimator fj. From ( |1.3[ ) we see that 

f] = r}[fl(7r; a, d) = tti; a, d)/u{'K, tti; a, d), where 


V = 


k ni 


,7ri;a,d) ^ a, d) and 


u = 


1=1 1=1 

k ni 


= u{Tr,np,a,d) := ^ ^ M(xf^; a, d) 


ni 

= 1 1=1 


with 


v^'^\x-,a,d):=f{x)u{x]a,d) and u{x-,a,d): = 


i'(x) 


Y.s=i cisi^s{x)/ds 


Note that it converges almost surely to 


(3.1) 


(3.2) 


y^^alE^^u{X■,a,d) 


1=1 


'x-As=i^s^s{x)/{ms/mi) rni 


(3.3) 
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as Til,..., rzfc —)■ oo. Thus u itself is a useful quantity as it eonsistently estimates the ratios of 
normalizing eonstants {^(Tr, tti) = m/mi|7r G If}. Unlike the estimator d in Seetion]^ u does 
not require a sample from eaeh density tt G If. Thus u is well suited for situations where one 
wants to estimate the ratios u{Tr, tti) for a very large number of tt’s based on samples from a small 
number of skeleton densities, say k. Thus this method is partieularly effleient when obtaining 
samples from the target distributions is eomputationally demanding and the distributions within 
n are similar. 

In the eontext of Bayesian analysis, let tt{x) = lik(a;)p(a;)/m be the posterior density corre¬ 
sponding to the likelihood function lik(x) and prior p{x) with normalizing constant m. In this 
case, M(7r, TTi) is the so-called Bayes factor between the two models, which is commonly used in 
model selection. 

The estimators ii and v in ( |3.1| ) depend on d, which is generally unknown in practice. Here 
we consider a two-stage procedure for evaluating u. In the 1st stage, d is estimated by its reverse 
logistic regression estimator d described in Sectionusing Markov chains with 

stationary densities tti, for I = 1,... ,k. Note the change of notation from Section where we 
used Til's to denote the length of the Markov chains. In order to avoid more notations, we use 
$;’s and Ni's to denote the stage I chains and their length respectively. Once d is formed, new 
MCMC samples = 1..., A; are obtained and u{7r, TTi){E^f) is estimated using 

u{ti, TTi; a, d) a, d)) based on these 2nd stage samples. Buta and Doss (20111 propose 


this two-stage method and quantify its benefits over the method where the same MCMC samples 
are used to estimate both d and u{n,ni). 

3.1 Estimating ratios of normalizing constants 

Before we state a CUT for u{ti, tti; a, d), we require some notation. Let 


T^{7r;a,d) = Yar^^{u{X{’-,a,d)) -f 2y^Cov^,a, d), a, d)) (3.4) 

9=1 

and r^(7r; a, d) = Further, define c(7r; a, d) as a vector of length A: — 1 

with (j — l)th coordinate as 


u(7r,7ri) 


ajiyj{x) 


[c{Ti;a,d)]j_i = = 2, ...,A; 

S' Jy^^s=lO'sl^s[x)/ds 

and c(7r; a, d) as a vector of length A: — 1 with (j — l)th coordinate as 


(3.5) 


C TT 


^ ajaiv{X?)v,{xf^) . . „ , 

;a,d)L_i = > — > - i - T —for j = 2,..., A:. 


(3.6) 
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Assuming ni = efii, then let 


1 ) ^ 

ff (vr; a, d) = —^ ^ [um{a, d) - M(a, d)f , (3.7) 

where Umia, d) is the average of the (m + l)st bloek a,d), - ■ ■ , u{X^2i+i)bi '■> ^)}’ 

and u{a, d) is the overall average of a,d), - ■ ■ , u{Xn}] a, d)}. Here, bi and e/ are the 

bloek sizes and the number of bloeks respeetively. Finally let r^ (vr; a, d) = Yl\=i / -^ 1)^1 (^j 

Theorem 2 Suppose that for the stage 1 ehains, eonditions of Theorerr^holds sueh that {d— 

d) A- A/'(0, V) as N = YAi=i Suppose there exists q G [0, 00 ) sueh that n/N —)■ q 

where n = sample size for stage 2. In addition, let ni/n —)■ si for I = 

!,■■■ ,k- 


1. Assume that the stage 2 Markov ehains $ 1 ,..., <f>fc are polynomially ergodie of order m, 
and for some 5 > 0 \u{X; a, d)\ < 00 for each I = 1, • • ■ ,k where m > 1 + 2/S. 

Then as rii ,..., —)■ cx), 

y/n{u{7r, TTi; a, d) — u{tt, VTi)) A X(0, gc(7r; a, <i)''"l/c(7r; a, d) + r^(7r; a, d)). (3.8) 


2. Let V be the eonsistent estimator ofV given in Theorem\l\(3). Assume that the Markov 
chains $ 1 ,..., <f)fc are polynomially ergodie of order m > (1 + e)(l + 2/5) for some 
e, 5 > 0 sueh that Et,/u{X-, a, d)|^+^ < 00 , and for all I = 1,. .. ,k, bi = [n/J where 
1 > z/ > 0. Then qc{7i] a, dYVc{'K] a, d) + r^(7r; a, d)) is a strongly eonsistent estimator 
of the asymptotic variance in p.8[). 


Note that the asymptotie varianee in p.8[ ) has two eomponents. The seeond term is the vari- 
anee of u when d is known. The first term is the inerease in the varianee of u resulting from 
using d instead of d. Sinee we are interested in estimating u{7r, tti) for a large number of vr’s and 
for every vr the eomputational time needed to ealeulate u in p.l[ ) is linear in the total sample size 
n, this ean not be very large. If generating MCMC samples is not eomputationally demanding, 
then long ehains ean be used in the 1st stage (that is, large N/s ean be used) to obtain a preeise 
estimate of d, and thus greatly redueing the first term in the varianee expression p.8[). 


3.2 Estimation of expectations using generalized IS 


This seetion diseusses estimating SEs of the generalized IS estimator f] given in p.3| ). In order to 
state a CLT for fj we define the following notations: 

00 

= A\T^-,a,d) = Var^,(u[^'(xP;a,d)) + 2^Cov^;(r;W(xf^;a,d),'u[^'(x2^;a,d)), 

9=1 
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= 7/^(7r;a,d) = 7^ = a, d), a, d)) 

OO 

+ ^[Cov^, a, d),u{xflg-, a, d)) + Co\^^{v^^\x[lg; a, d), M(xf^; a, d))], 


ff=i 


7 ^ = 7p(7r;a,d) = Var^;(M(xP; a, d)) + 2 ^ Cov^;(M(xf^ a, d), a, d)), 

9=1 

(note 7 ^^ is the same as a, d) defined in ( |3.4[ )) and 


Ti{'K]a,d) = 


^11 ^12 

^21 ^22 


k 2 

af 


; r(7r; a, d) = V —rz(7r; a, d). 

tr 


(3.9) 


Since fj has the form of a ratio, to establish a CLT for it, we apply the Delta method on the 
function h{x, y) = x/y, with Vh{x, y) = {1/y, —x/y"^)'. Let 

p(7r; a, d) = Vh{E^fu{'K, tti), M(7 r, 7ri))T(7r; a, d)Xh{E^fu{Ti, '71 i),u{ti, tti)), (3.10) 

e(7r; a, d) is a vector of length k — 1 with (j — l)th coordinate as 


[e(vr;a,d)]j_i = 


f [fix) - Enf]i^jix) 


7i{x)dx, j = 2 ,..., fc, 


>^x Yl’l=i(^sJ^six)/ds 
and e(7r; a, d) is a vector of length k — 1 with (j — l)th coordinate as 

d?j{E’:^i<^si's{xf'^)/ds)^ [c(7r;a,d)]j_i?7[^l(7r;a,d) 

e vr;a,d = - - —^-—-—^-: 

M(7r, TTi; a, d) M(7r, tti; a, d) 

where [c(7r; a, d)]j_i is defined in ( |3.6| ). Assuming ni = eibi, let 


(3.11) 


(3.12) 


fz(7r;a,d) = —^ ^ 

P; - ^ ^ 


ez-1 


e; - 1 


m=0 



ei-l 



=,[/] 




T 


E e;- 
m=0 


Vm - 


r^rrj. 


e; — 1 \ y^ei-1 

\ Z_-/m=0 

7^i(7r;a,d) ^^^{x;a,d) 
72i(7r;a,d) 722(7r;a,d) /’ 




where Vm is the average of the (m + l)st block 

nW is the overall average of a, d), • • • , a, d)} and Um = Umi'x, a, d),u = 

M(7r,a, d) defined in Section 3.1 Finally let r(7r; a, d) = ^|7^(a^/si)r;(7r; a, d), and 

p{x-,a,d) = Vh{v^-^\d),u{d)yr{x]a,d)Vh{v^'^\d),u{d)). 
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Theorem 3 Suppose that for the stage 1 chains, conditions of Theorem^hold such that 

d) —)■ A/'(0, V) as N = Yl\=i Suppose there exists q e [0, oo) such that n/N —)■ q 

where n = sample size for stage 2. In addition, let rii/n —)■ si for I = 

!,■■■ ,k. 

1. Assume that the stage 2 Markov chains $i,..., <l>fc are polynomially ergodic of order m, 
and for some 6 > 0 a, < cx) and Jut'll (X; a, < cx), for each 

Z = 1, ■ ■ ■ ,k where m > 1 + 2/6. Then as rii,..., —)■ cx), 

y/^(^[/]( 7 ,-- _ E.„f) -A N{0, qe{'K] a, dAVe{7r; a, d) + p(7r; a, d)). (3.13) 


2. Let V be the consistent estimator ofV given in Theorem^(3). Assume that the Markov 
chains $i,..., are polynomially ergodic oforderm > (l+e)(l+2/5) for some e,6>0 
such that ET,/u{X;a,d)\'^~^^ < oo, a, d)\^~^^ < oo, and for each I = I,--- ,k, 

hi = KJ where 1 > u > 0. Then qe{n-, a, dAVe{n-, a, d) + p{'K-,a,d)) is a strongly 
consistent estimator of the asymptotic variance in (|3.13|). 


Remark 2. Part (1) of Theorems and extend |Buta and Dossf s ( |2011| ) Theorems 1 and 3, 


respeetively. Speeifieally, they require ai = ni/n whieh is a non-optimal ehoiee for a (Tan et al. 


20151. Our results also substantially weaken the Markov ehain mixing eonditions. 


Remark 3. Theorems]^ andprove eonsisteney of the BM estimators of the varianees of u and 
fj for a general a. This extends results in Tan et al. (2015), whieh provides RS based estimators 
of the asymptotie varianee of u and fj in the speeial ease when a = (1, d). With this partieular 
ehoiee, u{x; a, d) and v^k](^x-, a, d) in ( |3.2| ) beeome free of d leading to independenee among 
eertain quantities. However, one ean set a = w * (1, d) for any user speeified fixed veetor w, 
whieh allows the expression in (2.18) of Tan et al. (20151 to be free of d and thus the neeessary 
independence. Hence, their RS estimator can also be applied to an arbitrary vector a (details are 
given in Appendix]^. 

Remark 4. A sufficient condition for the moment assumptions for u in Theorems and is that. 


for any vr e H, sup^. |7r(a;) j as7rs(a;)| < cxd. That is, in any given direction, the tail of at 
least one of {tTs, s = 1,..., fc} is heavier than that of tt. This is not hard to achieve in practice by 
properly choosing {tTs} with regard to 11 (see e.g. Roy 2014l. Further, if < oo then 

the moment assumptions for are satisfied. 
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4 Toy example 


In this section, we employee a toy example to confirm that both the BM and the RS estimators 
are consistent, as well as demonstrate the benefit of allowing general weights to be used in the 
generalized IS estimator. Let denote the t-distribution with degree of freedom r and central 
parameter /r. We set 7ri(-) = and n 2 {-) = z^2(-)’ which are the density functions for a 
and respectively. Pretending that we do not know the value of the ratio between 

the two normalizing constants, d = m 2 /mi = 1/1, we estimate it by the stage 1 estimator d from 
sectionand compare the BM and the RS method in estimating the asymptotic variance. As for 
the stage 2 estimators from section the choice of weight and performance of the BM and the 
RS methods in assessing estimators’ uncertainty are studied in Appendix]^ 

We draw iid samples from tti and Markov chain samples from 1^2 using the independent 
Metropolis Hastings algorithm with proposal density fs 1 . It is simple to show inf^ 


*5,0(3;) 


> 


0, which implies the algorithm is uniformly ergodic (Mengersen and Tweedie (1996) Theorem 
2.1) and hence polynomially ergodic and geometrically ergodic. For RS, our carefully tuned 
minorization condition enables the Markov chain for 112 to regenerate about every 3 iterations. In 
contrast, the BM method proposed here requires no such theoretical development. 

We evaluated the variance estimators at various sample sizes with different choices of weight. 
Figure displays traces of the BM and the RS estimates of the asymptotic variance of d, in 
dashed and solid lines, respectively. Overall, the BM and the RS estimates approach the empir¬ 
ical asymptotic variance as the sample size increases, suggesting their consistency. Due to the 
frequency of regenerations, BM estimates are generally more variable than RS estimates. Fur¬ 
ther, the left panel of Figure [^is for estimators based on the naive weight, a = (0.5, 0.5), that is 
proportional to the sample sizes; and the right panel is for estimators based on a = (0.82,0.18), 
that emphasizes the iid sample more than the Markov chain sample. Indeed, the latter weight is 
a close-to-optimal weight obtained with a small pilot study (see Appendix]^ for details). Using 
such a method to choose weight can lead to big improvement in the efficiency of d if the mixing 
rate of the multiple samples differ a lot. 


5 Bayesian spatial models for binary responses 

In this section, we analyze a root rot disease dataset collected from a 90-acre farm in the state 
of Washington ( Zhang| 20021. All computations are done in R, using the package geoBayes 
( |Evangelou and Roy , 2015). Recorded at M = 100 chosen sites are the longitude and the latitude 
Si, the root counts and the number of infected roots y{si),i = 1,..., M. Of interest is 
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Figure 1: Plots of BM and RS estimates of the asymptotie varianee of d in stage 1 for 100 
randomly ehosen replieations. The left panel is based on the naive weight, = (0.5, 0.5) 
and the right panel is based on a elose-to-optimal weight, = (0.82,0.18). Horizontal lines 
represent the empirieal asymptotie varianee of d obtained over all replieations. 


a map of the disease rate over the entire area for preeision farming. We eonsider the following 


spatial generalized linear mixed model (SGLMM), similar to that used by Zhang (2002) and Roy 


et al. (20161. Taking £(sj) and s* as fixed, let 


l/(si)| 2 :(si) ~ Binomial (£(«*), $ (z(si))) ,i = 1,... ,M. 

Here 2 = (^(si),..., z(sm)) is a veetor of latent variables, whieh is assumed to be a subveetor 
of a Gaussian random field (GRF) {2^,3 G S}, that has a eonstant mean /r, and a eovarianee 
funetion 

Cov (z(s),z(s')) = - s'll) +u:o^Is{s'). 

Here, is the partial sill, || ■ || denotes the Euelidean distanee, and is a eorrelation funetion 

from the spherieal family with range parameter (j). That is, Pfp{u) = 1 ~ | for 

u G (0,0). Next, Is{s') is an indieator that equals 1 if s = s', and equals 0 otherwise. Finally, 
is the nugget effeet, aeeounting for any remaining variability at site s sueh as measurement error, 
while 00 G is the relative size of the nugget to the partial sill. Following Roy et al. (20161 
we assign a non-informative Normal-inverse-Gamma prior to (/i, cr^) whieh is (eonditionally) 
eonjugate for the model. Speeifieally, 

1 


~ N(0,100(7^), and oc (cr^) 




exp 


2a2 
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Assigning priors for h = (0, u) in the correlation function of the Gaussian random field is usually 
difficult, and the choice of prior may influence the inference ( Christensen[ 20041. Hence we 
perform a sensitivity analysis focusing on obtaining the Bayes factor (BF) of the model at h 
relative to a baseline ho for a range of values h e V.. Note that for a fixed h = the 

Bayesian model described above has parameters ip = (/r, a^). Conditioning on the observed data 
y = (?/(si), ..., y{sM)), inference is based on the posterior density 


T^hi'iPly) = 


rrihiy) 


(5.1) 


where Lh{ip\y) = fiy\z)fhiz\'i/j)dz is the likelihood, Tr{'ijj) is the prior on tp, and mh{y) = 

Itix-r+ ^hi'ip\y)TTi'ip)d'ip is the normalizing constant, also called the marginal likelihood. The 

Bayes factor between any two models indexed by h and ho is mh{y)/mhQ{y), and the empirical 

Bayes choice of h is argmaxm/i(y) = argmax [rrihpy)/mhQ{y)]- Our plan is to get MCMC 

h&H hen 

samples for a small reference set of h, to estimate the BF among them using the reverse logistic 
estimator, and then get new samples to estimate {mh{y)/mh'{y), h e V.} using the generalized 
IS method. Below, we describe the MCMC algorithms and the practical concern of how long to 


run them, which illustrates the importance of calculating a SE. 


The two high-dimensional integrals lend the posterior density in (5.1) intractable. But there 


exist MCMC algorithms to sample from the augmented posterior distribution, that is 


7ih{'ip,z\y) 


f{y\z)fh{z\'ip)7i{pj) 

rrihiy) 


(5.2) 


Note that j^Tihipr, z\y)dz = Tihi'ip\y)- Hence, a two-component Gibbs sampler that updates pj 
and z in turn from their respective conditional distributions based on ( |5.2| ) yields a Markov chain 
{^ph\ with stationary distribution nhiip, z\y). As a result, the marginal is also 

a Markov chain with stationary distribution Whiiply) ( |Tanner and Wong| 19871. 

As a starting point, we use a small pilot study to identify a range for h = ip,uj) that corre¬ 
sponds to reasonably large BF values. This step is carried out by obtaining the reverse logistic 
estimator of BF at a coarse grid of h values over a wide area, based on short runs of Markov 
chains. Specifically, (0, oj) G [80, 200] x [0.2, 2] and within this range the minimum BF is about 
1% the size of the maximum. To more carefully estimate BF over this range, we examine a fine 
grid H that consists of 130 different h values, with increments of size 10 for the (p component, 
and that of .2 for the w component. 

A natural choice for the set of skeleton points is {80,140, 200} x (0.5,1, 2}, with an arbitrarily 
chosen baseline at (200, 2). We first experiment with samples of sizes ni = ■ ■ ■ = ng = 500 at the 
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skeleton points (after a bum-in period of 500 iterations and a thinning proeedure that keeps one 
sample every 10 iterations), of whieh the first 80% are used in stage 1, and the remaining in stage 2 
of the generalized IS proeedure. BF estimates at all h G H are obtained, though not shown. 
Given the eurrent Monte Carlo sample sizes, it is natural to eonsider how trustworthy these BF 
estimates are. To this end, Figure|^shows the point-wise SEs over the entire plot obtained via the 
BM method. In this setting for some h, the magnitude of the SE is over 63% of the eorresponding 
BE estimate. Henee, inferenee based on these BE estimates eould be questioned beeause of the 
high eomputational uneertainty. 

Suppose we wish to reduee the relative SE to 5% or less for all h & H, then roughly 
(60%/5%)^ = 144 times as many samples would be required under the eurrent design. In¬ 
stead, we eonsider an alternative design using a bigger set of skeleton points, {80,140, 200} x 
(0.5,1,1.5, 2} keeping the baseline unehanged at (200, 0.2). In this alternative design, with sam¬ 
ple sizes rii = ■ ■ ■ = ni 2 = 500, the largest relative SE is 17%. This is almost one fourth that 
of the previous design, but the eomputing time only inereased by 40%. Aeeordingly, running the 
alternative design would aehieve the eomputing goal mueh faster. In this ease, we inerease the 
sample sizes to ni = ■ • • = ni 2 = 6000, whieh is approximately (17%/5%)^ times 500. Overall, 
the new proeess takes 2 minutes to run on a 3.4GHz Intel Core i7 running linux. The resulting 
BE estimates are shown in Eigurej^ with maximum relative SE redueed to 5.0%. Eor the sake of 
eomparison, mnning the original design for the same amount of time allows a eommon sample 
size of Ui = 8000 resulting in a maximum relative SE of 14.4%. In short, easily obtainable SE 
estimates allows us to experiment, ehoose among different designs, and perform samples size 
ealeulations. 

The simplieity of the method matters when it eomes to estimating SEs in praetiee. Using 
the BM method to obtain SE requires no extra input beyond what is needed for obtaining the 
generalized IS estimates. Indeed, as long as one ean run existing software to obtain the Markov 
ehain samples, there is no need to know the Markov transition kernels utilized in the baekground. 
Unlike the BM method, the RS method depends on identifying regeneration times, typieally 


through eonstrueting minorization eonditions for the Markov transition kernels (see Mykland 


et al. (1995) for details). Despite the faet that minorization eonditions ean be established for any 


Markov transition kernel, we demonstrate for the eurrent example the amount of effort needed 
to obtain a regeneration ean be prohibitively high. Reeall the MCMC seheme involves sampling 
from 7r/j('0|2, y) and Tih{z\ip, y) in turn. The former is a standard distribution henee easy to sam¬ 
ple from. The latter is not, and we followed [Diggle et ah ( 1998| ) that updates Zj^j = 1, • • ■ , M in 
turn, eaeh using a one-dimensional Metropolis-Hastings step that keeps invariant the eonditional 
posterior distribution of Zj given all other eomponents. Denote the transition density of these MH 
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SE of BF estimates SE of BF estimates 




Figure 2: Contour plots of the SEs (in log seale) evaluated for the BF estimates. The two plots 
are based on the original and alternative designs. Skeleton points used in eaeh design are marked 
by erosses, with the baselines marked by boxed erosses. 

steps as /i, • ■ ■ , /m, and suppressing the notations of their dependenee on y, the transition kernel 
of the Markov ehain ean be represented as 




Aeeording to a eommon method deseribed in Jones and Robert (2004), one ean build a minoriza- 
tion eondition by finding D C x M x M+, e > 0, and k{-) sueh that, 


p{z', 'll]) > eloiz, tjj) k{z', iIj') for all (z', -0') G x M x 


Further, the above eondition ean be established if 

fl{z[\z 2 r-- - , z'm-i, \z') 

>lD{z,'tp)eiki{z[)e 2 k 2 {z[,z'^) ••• eMkM{z[,--- ,z'j^)'Kh{'tp'\z') for all (z', V’O G x M x IR+, 

where the eommon term Hh on both sides of the inequality will eaneel, and henee the work 
is in finding ei, • • • , and fci(-), • • • , It’s ^^sy to see that the smaller the set D, the 

larger e = ean possibly be, where e ean be interpreted as the eonditional regeneration 

rate given D is visited. Suppose we take D to be small enough sueh that ej takes on a very 
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Figure 3: Based on samples of size 6000 of each of the 16 Markov chains, the left and the middle 
panels display a surface plot and a contour plot for the BF estimates (in log scale), respectively. 
The right panel shows the ratio of the SE to the BF estimates (in log scale), where the SEs are 
evaluated using the BM method. 


large value of 0.8 for each j, then the probability of getting a regeneration given a visit to D is 
e = (0.8)^°° 2 X 10“^°. Being overoptimistic that the Markov chain visits D with probability 

close to 1, it would still take 100 billon iterations for the chain to regenerate about twenty times, 
barely enough for the RS method to be effective. 

Using the EB estimate h of h, estimation of the remaining parameters -0 and prediction of the 
spatial random field can be done in the standard method using MCMC samples from 
(see e.g. |Roy et al.[|201^ section 3.2). Thus we can produce the root rot disease prediction map 
similar to that in Roy et al. ( 2016[ Web Eig. 10). 


6 Discussion 

In this paper we consider two separate but related problems. The first problem is estimating the 
ratios of unknown normalizing constants given Markov chain samples from each of the k > 1 
probability densities. The second problem is estimating expectations of a function with respect 
to a large number of probability distributions. These problems are related in the sense that gen¬ 
eralized IS estimators used for the latter utilize estimates derived when solving the first problem. 
The first situation also arises in a variety of contexts other than the generalized IS estimators. 

Eor both problems, we derive estimators with flexible weights and thus these estimators are 
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appropriate for Markov chains with different mixing behaviors. We establish CLTs for these 
estimators and develop BM methods for consistently estimating their SEs. These easy to calculate 
SEs are important for at least three reasons. Eirst, SEs are needed to assess the quality of the 
estimates. Second, our ability to calculate SEs allows us to search for optimal weights a for both 
stage 1 and 2. And last but not least, SEs form the basis for comparison of generalized IS with 
other available methods for estimating large number of (ratios of) normalizing constants. 

Although we compare BM and RS in this paper, spectral estimators can also be derived for 


variance estimation using the results in Vats et al. (2015b). However, estimation by spectral 


methods is generally more expensive computationally. Eurther, Elegal and Jones (2010) compare 
the performance of confidence intervals produced by BM, RS and spectral methods for the time 
average estimator, and they conclude that if tuning parameters are chosen appropriately, all these 
three methods perform equally well. Control variates can be used to further improve the accuracy 
of our generalized IS estimators (Doss 2010 Owen and Zhou 2000). A direction of future 
research would be to establish a BM estimator of the SEs for control variate based methods. 


A Proof of Theorem |T] 


In the appendices, we denote |Doss and Tan| ( |2014| ) by D&T. The proof of the consistency of 
d follows from D&T section A.l and is omitted. Establishing a GET for d is analogous to 
section A.2 of D&T, but there are significant differences. Below we establish the GET for d and 
finally show that E is a consistent estimator of V. 

We begin by considering — <^q). As before, let V represents the gradient operator. We 

expand at C around <^q, and using the appropriate scaling factor, we get 


- Co). 


(A.l) 


where C* is between ^ and Co- Consider the left side of ( |A.1[ ), which is just V('n(Co)o 

since V£n(C) = 0. There are several nontrivial components to the proof, so we first give an 
outline. 

1. Eollowing D&T we show that each element of the vector n~^Vin{Co) can be represented 
as a linear combination of mean 0 averages of functions of the k chains. 

2. Based on Step 1, applying GET for each of the k Markov chain averages, we obtain a GET 

for the scaled score vector. In particular, we show that tn{Co) -^(0) where 


Vt defined in (2.8) involves infinite sums of auto-covariances of each chain. 
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3. Following Geyer( 19941 it can be shown that—n ^V^£n(C*) and that (—n ^V^£n(C*)) 

B\ where B is defined in ( |2.7[ ). 

4. We conclude that — Co) -^(0, 


5. Since d = g{Co) and d = g{C), where g is defined in ( |2.4[ ), by the Delta method it follows 
that n^/^{d - d) A V) where V = B^B^QBW. 


We now provide the details. 

1. Start by considering n~^V£n(Co)- r = 1,..., fc, from D&T we have 
dL(Co) 


ni 


dCr 


= Wr 


^(1 -p^(Xf\Co)) 


i=l 


1=1 2=1 


(can be shown to) = Co) “ [l “ Co))] ) (A.2) 


2=1 

k ni 


E E Co) - E„ (pAX, Co))] . 


/ = ! 2=1 

l^r 


That is, ( |A.2[ ) can be used to view n ^c)£„(Co)/^Cr as a linear combination of mean 0 
averages of functions of the k chains. 

2. Next, we need a CLT for the vector V£n(Co) = (^^n(Co)/^Ci)''' ) ^^n(Co)/^Cfc)^> that is, 
to show that £n{Co) A iV(0, fl) as n — )■ cx). Note that. 


l_ <9C(Co) 

n dCr 


ni 


E "-i E [Po(-E\ Co) - E„ (pAX, Co))] 


= 1 i=l 

k I - ni 


K I - .. Ill 

- E E Co) - -B., {pAX, Co))] 

k 


1=1 


where Fh-O ;= T YlT=i and is as defined in ( |2.6[ ). Since Pr{x, C) £ (0,1) for 
all X, r and C, we have (|pr(X, Co) “ E^^^iPriX, Co))P^^) < for any 5 > 0. Then 
since is polynomially ergodic of order m > 1, we have asymptotic normality for the 
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univariate quantities (see e.g. Corollary 2 of Jones (2004)). Since ni/n —)■ si for 

I = 1,..., k and a^’s are known, by independence of the k chains, we conclude that 

diniCo) _d^ 7V(o, as n —)■ oo, 


\/n dCr 

where 12 is defined in ( |2.8| ). Next, we extend the component-wise CLT to a joint CLT. 
Consider any f e (fi, ■ ■ ■ , 4) G we have 

. 1 dUCo) , , . 1 diniCo) 

dCi dc. 


y(i.O Y 

'^{tiy/nai ^=^ ^ — + ■ ■ ■ + tkVnai^=^ * 




1=1 

k 


ni 

I + 


ni 




Eti [tiyr’ + ■■■+tkY, 


{k,i) 


i/Yi 


—)■ A/'(0, as 77. —)■ cxD. 


Hence, the Cramer-Wold device implies the joint CLT, 

ri“'/'V4(Co)^Ar(0,fi) 


as n —)■ cxD. 


(A.3) 


Steps 3-5 are omitted since the derivations are basically the same as in D&T. 

Next we provide a proof of the consistency of the estimate of the asymptotic covariance 
matrix V, that is, we show that V = V = D^B^QB^D as n —>■ oc. Since 

C <^o d d, it implies that D D. From D&T, we know that B B and using 
the spectral representation of B and of U, it follows that B^. 

To complete the proof, we now show that 12 12 where the BM estimator D is defined in 

( |2.13| ). This will be proved in couple of steps. First, we consider a single chain used to calcu¬ 
late k quantities and establish a multivariate CLT. We use the results in Vats et al. (2015al who 
obtain conditions for the nonoverlapping BM estimator to be strongly consistent in multivariate 
settings. Second, we combine results from the k independent chains. Finally, we show that 12 is 
a strongly consistent estimator of 12. 

Denote y(2,0^ ^ j Similar to deriving ( |A.3[ ) via the Cramer-Wold 

device, we have the following joint CLT for Wf., -4 A/'(0, as rii —)■ oo, where 

is a k X k covariance matrix with 


sS = -'’if •'>} + En{Yk‘Ylf} + F EnlY^rY)- 


(A.4) 


2 = 1 


2=1 
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The nonoverlapping BM estimator of is given in p.lO[ ). We now prove the strong eonsis- 
tency of Note that is defined using the terms whieh involve the random quantity 

We define to be with <^q substituted for that is, 


^ r\ ^ 

m=0 


y(0 _ y(0 


T 


for I = 1,... ,k, 


where Ym = 


..., ' with := Y}S^bXi We prove 

a.s. 




j=mbi+l 

in two steps: (1) and (2) 0. Strong eonsistency of 

the multivariate BM estimator requires both —)• oo and bi —?• oo. Sinee for all r, 

Eni{\Pr{XXo) - Eni{Pr{XXo))\^^^) < ^ ^r auy 5 > 0, is polynomially ergodie of order 
m > 1, and 6/ = [X} where 1 > i/ > 0, it follows from Vats et al. ( |2015a ) that (<^q) 
as ni —)■ cx). We show — eS(Co) 0 where E^s and eS(Co) the (r, s)th elements 
of the k X k matrices eS and eS (Co) respectively. By the mean value theorem (in multiple 
variables), there exists C* = fC + (1 — f)Co for some t e (0,1), such that 


£S-£?i(c„) = v£S(c)-(<-Co). 


(A.5) 


where ■ represents the dot product. Note that 


fii-i 


SS(C) = —Y El^m'HC) - z'-’--‘\C)]lzt‘HC - 

6; — i — 


m=0 


where F^’^^(C) := Ej=mfeSi C)/&« and FW)(C) := J2]LiPr{xf\C)/ni. Some calcu¬ 

lations show that for f r 




ac 


and 


= S E -P,(Af ,0). 


{m+l)bi 

E 

■ j=mbi+l 

(m+l)6; 


j=mbi+l 

We denote UY := - E^^ri^X)], ■= Z^^KC) - E^^ri^X)], and similarly the 

centered versions of dZm’'\C)/dCt and dZ^'^’’-\C)/dCt by and respectively. Since 
Pr{X, C) is uniformly bounded by 1 and is polynomially ergodie of order m > 1, there exist 
< oo such that XkUY 4 iV(0, cr^), 4 N{0,X), VbiVi"’'^ 4 Ar(0,r5), and 
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asW(C) 


dCt 


3T - uiVbiivt^^ - A(t^^ - ^^)] 


e; - 1 


m=0 


e^-l 


e; - 1 


m=0 


rr E v^c^™Av'j.*’'' + VbiVt'’''Vbfi; 


ei-1 I 




It is easy to see that the negative term in the above expression goes to zero as e; —>■ cx). Further, 
sinee 




we have 


ej-l 


ez - 1 




771=0 




2ez-l 


m=0 


771=0 


where the last step above is due to strong eonsisteney of the BM estimators for the asymptotie 
varianees of the sequenees {pr{Xj’'\C)-,j = {dps{Xj’‘\C)/dCf,j = 1, •'' 

respeetively. Similarly, we have 


ej-l 


—r E VbiU‘ 


ez - 1 


771=0 


1 1 

<-4tE 


2ez-l 


771=0 




r,t)\2 


+ X 


1 1 


ej-l 


2ez-l 


rr E HvLf] 


771=0 


a-s.^ I -^^2 

2 ^’*^2 "■ 


Note that the terms U^Vm’^\ 4,4^, etc, above actually depends on C, and we are indeed con¬ 
cerned with the case where C takes on the value C*, lying between ^ and Co- Since, C Co’ 
C* Co as nz —)■ cxD. Let ||m|| denotes the Li norm of a vector m G So from ( |A.5| ), and the 
fact that dTir} {C)/dCt is bounded with probability one, we have 

a£S(C‘ 


PS - eS«o)| < 


max 

l<t<A; 


dCt 


IC-Cc 


0 as n —)■ oo. 


Since for I = 1,... ,k, it follows that E E where E is defined in ( |2.11| ) 

and E is the corresponding k'^ x k"^ covariance matrix, that is, E is a block diagonal matrix as E 
with E^^) substituted for E*^*C I = 1,..., k. Since rii/n —)■ s; for Z = 1,..., fc, we have —)■ Ag 


as n —)■ oo where An is defined in (|2.12|) and 

A, = 


— o-ih 

Si 


— 0-2lk 
S2 


(^k^k I • 
Sk 


Finally from ( |2.8| ) and ( |A.4| ) we see that f2 = A^EAj. So from ( |2.13| ) we have f2 = A„EA^ 
A^EAT = as n — )■ oo. 
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B Proof of Theorem |2| 


As in Buta and Doss (2011) we write 


y/niiiij, TTi; a, d)—u{7i, tti)) = y/niuijr, tti; a, d)—u{n, tti; a, d))+y/n{u{'K, tti; a, d)—u{7i, vti)). 

(B.l) 

First, eonsider the 2nd term, whieh involves randomness only from the 2nd stage. From ( |3.3| ) 
note that Yl’i=i (^iEniU{X; a, d) = u{7i, tti). Then from ( |3.1[ ) we have 


^ - E^^u{X;a,d)) 

n{u{'n-,7ii; a, d) — u{'n-, Til)) = y an ' 


1=1 




Sinee is polynomially ergodie of order m and \u{X] a, d)\^^° is finite where m > 1 + 2/5, 
it follows that '■> d) — ETriU{X] a, d))/^Jrii -4 iV(0, r/(7r; a, d)) where r/(7r; a, d) 

is defined in ( |3.4| ). As ni/n —)■ s/ and the Markov ehains are independent, it follows that 
^/n{u{7r, TTi; a, d) — M(7r, TTi)) A iV(0, r^(7r; a, d)). 

Now we eonsider the 1st term in the right hand side of ( |B.1[ ). Letting E{z) = u{7i, tti; a, z), 
by Taylor series expansion of E about d we have 


^ A / 77 ^ 

V^{E{d) - E{d)) = y^VF(d)^(d -d) + E—(d - d)^V^F(d*)(d - d), (B.l) 


where d* is between d and d. Simple ealeulations show that 

a,;/,(+'+(+'>) 


k ni 


ir (e:., a.^,{xP)/dM 


[c(vr; a, d)]j_i 


(B.3) 


where [c(7r; a, d)]j-i is defined in ( |3.5[ ). We know that n/N — q. Using similar arguments as 
in 


Buta and Doss (20111, it follows that X^E{d*) is bounded in probability. Thus from ( |B.2[ ) we 


have 


Vn(,F{d) - F{d)) = ^VF(,dVVN(d - d) + “ d)|W=F(d-)[^(d - d)| 

= y/qc{7T] a, d)~^ i/N {d - d) + Op(l). 


Then Theorem |^(1) follow from ( |B.1| ) and the independenee of the two stages of Markov ehain 
sampling. 

Next to prove Theorem (2), note that, we already have a eonsistent BM estimator V of 
V. From ( |B3] ), we have [c(7r; a, [c(7r; a, Applying mean 
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value theorem on ['VF(d)]j_i and the fact that V^F{d*) is bounded in probability, it follows that 
[c(7r; a, d)]j-i - [c(7r; a, d)]j_i 0. Writing c(7r; a, d)’^l/c(7r; a, d) as YliZi Y!^=i CiVijCj, it 
then follows that c(7r; a, d)~^Vc{7i; a, d) c(7r; a, dYVc{'K] a, d). 

We now show ^ consistent estimator of a, d) where rf and Y are defined 

in p.4| ) and p.7| ), respectively. Since the Markov chains {xf^Yi independent, it then fol¬ 
lows that r^(7r; a, d) is consistently estimated by T^(7r; a, d) completing the proof of Theorem]^ 
( 2 ). 

If d is known from the assumptions of Theorem]^ (2) and the results in Vats et al. (2015a), 
we know that YY', d) is consistently estimated by its BM estimator YY', d). Note that, 
YY'i d) is defined in terms of the quantities u{xY', cl, <i)’s. We now show that YY'i d) — 
fYn;a,d) ^ 0. 

Denoting YY', cl, z) by G(z), by the mean value theorem (in multiple variables), there exists 
d* = td + {1 — t)d for some t G (0,1), such that G{d) — G{d) = XG{d*) ■ {d — d). For any 
j e {2, ■ ■ ■ , k}, and z G 


dG{z) 

dzj 


ei 


'ei-l 

2(Mm(a, z) - M(a, z)) 

m=0 


dum{cL,z) du{a,z] 


dzj 


dzi 


(B.4) 


Let Wm ■= UmicL, z) — Et,^{u{X] a, z)) and W := u{a, z) — a, z)). Note that, there 

exists, cr^ < oo such that YhiWm -4 N(0,cr^), and ^/niW A- N(0,cr^). Simple calculations 
show that 


dUm{cL, Z) ttj 1 


dzi 


{m+l)bi 
J i=mhi-\-\ 


z? k 


ujxYMxn 


(0^ 


Hence, letting aj = E^Jz/(X)z/j(X)/ asi^si^)/Zsf], we write 

duyn{ci',z) du{a,z)^a^.- ^ ^ \ 


dzi 


dzi 


z. 


J 


-J 

r,v fv(d\ /rv- ^ ,, fvii)) 


where = ( 1 / 6 ;) )/{!]« a^i^siXl AAsY] - oij and Zj is similarly de¬ 
fined. Note that, there exists rf < oo, such that y/FiZ^j A N(0, r|), and ^(0; AA 
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From (IB .41) we have 


ei-l 


^ =3 E [ - »') Vi. ( 

^ m=0 

=3^ E ['/b.W^Vb.Z,^., 


m=0 


aj , 
-^26, 


Qj 2 


S - 1 „.o 


1 _ 1 
Zj —7 V + W —- V 

~ ^ ei — 1 ei — 1 

^/nlW^/nlZj 


ei-l 


Y, [A^m A^m,. 


a,' 2 


Then using similar arguments as in the proof of Theorem it ean be shown that dG{z)/dzj is 
bounded with probability one. Then it follows that 


\G{d)-G{d)\ < max 


dG{dr 


dZn 


\d-d\\ 


C Proof of Theorem |3| 


As in the proof of Theorem]^ we write 

^/n{^f^\^l]a,d)-E^f) = y/n{Tf^\Ti]a,d) -Tf^\'K]a,d)) + y/n{fi^^\Ti]a,d) - E^f). (C.l) 
First, eonsider the 2nd term, whieh involves randomness only from the 2nd stage. Sinee 

^^ash's{x)/{ms/mi) mi 


^^a,T;.,uW(X;a,d)= [ ^ 
1=1 2x L.=i 


we have d) = Et^/u^tt, tti). Then from ( |3.1| ) we have 


n 


i)W(7r; a, d) - E^fu{Tr, tti) 


^ V ^ V u{xf^- a, d) - E^^u{X; a, d) 

(C.2) 


-u(7r, TTi; a, d) — u{7r, tti) 

From the eonditions of Theorem and the faet that the Markov ehains = 1,... ,k are inde¬ 


pendent, it follows that the above veetor (C.2) eonverges in distribution to the bivariate normal 


distribution with mean 0 and eovarianee matrix r(7r; a, d) defined in p.9| ). Then applying the 
Delta method to the funetion g{x, y) = x/y we have a CLT for the ratio estimator a, d). 
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that is, we have y/n{fj^-^\7;]a,d) — E^^f) A- iV(0, p(7r; a, d)) where p{n]a,d)) is defined in 
Next letting L{z) = (tt; a, z), by Taylor series expansion of L about d we have 

V^(L(d) - L(d)) = v^VL(d)^(d-d) + ^{d - dyv^L{d*){d - d), (C.3) 

where d* is between d and d. Simple calculations show that 

[VL{d)]j_i = [e(7r;a,d)]j_i ^ [e(7r; a, d)]j_i (C.4) 


where [e(7r; a, d)]j_i and [e(7r; a, d)]j-i are defined in p.l 1| ) and p.l2| ) respectively. It can be 
shown that V‘^L(d*) is bounded in probability. Thus from ( |C.3| ) we have y/n{L{d) — L{d)) = 
y/qe{7r] a, d^y/N{d — d) + Op(l). Then Theorem[^(l) follow from ( |C.1[ ) and the independence 
of the two stages of Markov chain sampling. 

Next to prove Theorem[^(2), note that, we already know that V is a consistent BM estimator 
of V. From ( |C.4[ ), we have [e(7r; a, d)]j_i [e(7r; a, d)]j_i. Applying mean value theorem on 
[VL(d)]j_i and the fact that V‘^L{d*) is bounded in probability, it follows that [e(7r; a, d)]j_i — 
[e(7r; a, d)]j_i 0. 

From p.8[ ) we know that fi(7r, tti; a, d) M(7r, tti). From p.l3[ ) we know a, d) 

Et,/. SinceTTi; a, d) = a, d)M(7r, tti; a, d), it follows that tti; a, d) -^A 

E.„fu{n,7ri). Thus TTi; a, d),-u(7r, TTi; a, d)) Vh{E.„fu{n,ni),u{Tr,7ri)). Thus 

to prove Theorem 1^(2), we only need to show that ri(7r; a, d) ri(7r; a, d). 

If d is known from the assumptions of Theorem|^(2) and the results in Vats et al. (2015al, 
we know that r;(7r; a, d) is consistently estimated by its BM estimator r;(7r; a, d). We now show 
that ri(7r; a, d) — r;(7r; a, d) 0. 

From Theorem 1^ (2), we know that a, d) — a, d) 0. We now show 




■Ji'A', a, d) — d) —4 0. Letting 7 /H^j A by HA)y by the mean value theorem, 

there exists d* = td+ (1 —t)d for some t G (0,1), such that E[{d) — H{d) = VH{d*) ■ (d — d). 
For any j G {2, ■ ■ ■ , k}, and z G 


dHA) 

dzj 


ei-l 


A-i 

E2(* 

m=0 


[/], 


a, z) — z)) 


dv. 


[/], 


a,z 


^-y[/]( 


a,z 


dzi 


dzi 


Let wA ■= Vm\a,z) — E^.^A^'^K^'y A) := v^-^^{a,z) — a, z)). Note 

that, there exists, aj<oo such that AbiWA A N(0,(Tj), and A N(0, cr|). Simple 
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calculations show that 


dvll\a,z) _ ttj 1 


dzi 


/(xfV(xfV,(xf^) 

)/2;. 


Hence, letting = E^^[f{X)u{X)uj{X)/{J2s we write 


dvll\a,z) du^f^{a,z) _ aj [-[/] 




where Z'^! = (1/6,) Eti[/(-’fr)''(-’fr)-'l(At’)/{E, a.-'.(A'.“')/e.}"l - <' and Zf is sim- 
ilarly defined. Note that, there exists sueh that y/FiZ^j A N(0,r|j), and ^/fTiZj -A 

N(0, T^)- The rest of the proof is analogous to Theoremj^ in that we have 

r> 

y/niW^-^^y/niZ^^^ 


dH{z) 


dz-i 


'^4 -*- f. ^ 

J m=0 


ttj 2 


Then it eanbe shown a, d)—'yl^{Tr; a, d) 0 and finally 7 /^(vr; a, d)—'yP{7i; a, d) 

0 . 


D Regeneration with general weights 


Tan et ah] ( |2015| ) provide a regeneration based eentral limit theorem (CLT) for the estimators fj 
and it defined in 1.3 and 3.1 respeetively in the main text. In the ease when d is unknown, they 
allow only a speeial ehoiee for the weight veetor, namely a = (1, d) for their results to hold, 
where d is the estimator of d based on the Stage 1 ehains diseussed in Seetion 2 of the main text. 
In this seetion, we establish a regeneration based CLT for fj and u with any ehoiee of the weight 
veetor a. 

We will refer to the following eonditions. 

A1 For eaeh I = 1,... ,k, the Markov ehain = {Xq\x[’‘\ ...} is geometrieally ergodie 
and has tt/ as its invariant density. 

A2 Let ki : X X X —)■ [0, cxd) be the Markov transition density for $/, so that for any measurable 
set A we have G A\Xn'^ = x) = J^ki{y\x)n{dy). Suppose that for eaeh / = 

1,..., k, ki satisfies the following minorization condition: 


ki{y\x) > si{x) qi{y) for all x,y eX, 


(D.l) 
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where the funetion s/: X —[0,1) with E^^si > 0, and qi is a probability density funetion 
on X. 


A3 Recall the functions u{X; a, d) and (X; a, d) defined in (3.2) of our paper. There exists 
e > 0 such that (X; a, d)and E^.^\u{X; a, d)\‘^~^'^ are finite. 

A4 Suppose is simulated for Ri regenerative tours for / = 1,..., fc. Assume Ri/Ri ^ bi e 
(0, oo) as Ri —)■ oo. 


Following |Tan et al.| ( |2015[ ), let the regeneration times for the Markov chain be = 
0, tP,T 2 \ _ Accordingly, the chain is broken up into “tours” { (X (z) ,..., X (o A, t = 

''' ’"t-i ''"t 

1,2,...} that are independent stochastic replicas of each other. Suppose we simulate Ri tours of 
the Markov chain for / = 1,..., fc, so the length of the chain is ni = t^\ Also as in 


Tan 


et al. (2015), for f = 1,2,..., i?; define 


r«-l 


rP-1 


r«-i 


k'’= Y1 v'^X‘'';a,d), Y1 and r,''> = 1 




r(0 _ 


( 0 . 


"'(0_ 


_ ^(0 


SI) 


= R' - 


(0 

*=A-i 


(0 

-J- ' 


(D.2) 

where the sums range over the values of i that constitute the t* tour. 

Recall from Remark 4 in Section 3 of our paper, when d is unknown, we set a = w * (1, d) 
where * denotes component-wise multiplication. That is, (oi,..., a*,) = (wi, W 2 , ■ ■ •, Wk) * 
(1, ^ 2 ,..., dfc) for any pre-determined weight w. With this choice, the expressions for u and 
in (3.2) become 


u 


(x; w * (1, d),d) = 


/WX. (D.3) 


The above quantities do not involve d, and consequently for each /, the triples , (7®, , t = 

0,1, 2,... defined in (|D.2[) are iid, and we have independence across Vs. The estimator for r] re- 
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duces to 


widi 


ni 


V = flN,n{'W * (l,d),d) = ^ ^ 


/(xfV(^r) 

.(/) 


(0^ 


A; j 

r)7 




fcr•fctE:.i»i<^.(Ar)/ tr tt e:., > 


(D.4) 






r(0 


rij 

1=1 ‘ z=i 


ni 

1=1 ‘ z=i 


2^Widij^ }_^Widi 
1=1 ' 1=1 


Td) ’ 


where 


r«-l 


f/<-> = 5: 


v(Xl 


(0^ 


r«-l 


and 


•_ (0 




(O' 


/(■Yi'V(.Y|'0 


(O' 


. (0 


(O' 


^ Tf'''' be the average tour length and, analogously, ^ 

Ud'> = R~^ '^fh Ut’'\ Similarly, the estimator for m/rrii reduees to 

^ Widt s= jM) ; t''** 

1=1 ^ t=l 1=1 


‘Td) ■ 


Theorem 4 below gives the asymptotie distributions of i) and u. It extends Tan et al. s (20151 
Theorem 2 to the general ehoiee of weight veetor a. To state the theorem, we first need to define 
some notation. Let M and L be the veetors of length k — 1 for whieh the (j — 1)* eoordinates 
are, for j = 2,..., k, 

Mj-i = WjE.„.u and 

{Y=iU!idiE^^v'^d]'j(^yj.E^,u) (D.6) 


Lj-i — 


Y^i^-^widiE^^u 


{YA=iWidiE^iUy 


As in Tan et al. (2015), assume that in Stage 1, for / = 1,..., fc, the ehain has been run for pi 
regenerations. So the length of the ehain, Ni = + ... + is random. We assume that 

pi,..., Pfc —)■ cxD in sueh a way that pi/pi ci G (0, cxd), for Z = 1,..., fc. 
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Theorem 4 Suppose that for the Stage 1 chains, conditions A1 and A2 hold, and that for the 
Stage 2 chains, conditions A1-A4 hold. If pi ^ oo and Ri ^ oo in such a way that R\/pi —>■ 
g G [0, oo), then 


R\'‘^ {u - m/mi) 4 Ar(0, qM^WM + k^) 


and 

R}^^ {p-ri) a Ar(0, qUWL + r^), 

where M, L are given in equations ( |D.6[ ), W, rf and are given in equations (2.15), (2.8), and 
(2.10) oj \Tan ef^. ( 2015^ , respectively. In their (2.8) and (2.10), a is taken to be a = w * (1, d). 
Furthermore, we conform strongly consistent estimates of the asymptotic variances if we use W, 
if^, and defined in (2.16) and (2.11) of Tan et al. (2015), respectively, and use the standard 
empirical estimates of M and L. 


D.l Proof of Theorem 4 

We first prove the CLT for q. Note that 


Rf [fj{w*{l,d),d)-p] = Rf [fj{w*{l,d),d)-fi{w*{l,d),d)]+Rf [fi{w*{l,d),d)-p]. 

(D.7) 


The seeond term on the right side of (D.7) involves randomness eoming only from Stage 2 


sampling, and its distribution is given by Theorem 1 of Tan et al. (2015): it is asymptotieally 
normal with mean 0 and variance r^. The first term involves randomness from both Stage 1 and 
Stage 2 sampling. However, as in the proofs of Theorem 2 and 3, we can show that for this term, 
the randomness from Stage 2 is asymptotically negligible, so that only Stage 1 sampling con¬ 


tributes to its asymptotic distribution. Finally, the asymptotic normality of the left side of ( |D.7| ) 
follows since the two stages of sampling are independent. We now provide the details of the 
proof. 

Consider the first term on the right side of (|D.7[). Recall that if a = w * (1, d), then 


:= a, d) = — u{x) := u{x-,a,d) = 


u(x) 






With (D.4) and (D.51 in mind, define the function 


k ni 

WlZl 


k ni 

WlZl 


A(z) =,](«.»(1. -). -) = E ^ ) / E T E “(-E) 

ni “ / “ ni ^ 

i=i 1=1 i=i 1=1 

for z = (z 2 , • • •, zfA, with zi > 0 for / = 2,..., fc, and Zi = 1. Note that setting z = d gives 
A(d) = t)(w * (1, d), d), and setting z = d gives A(d) = q(w * (1, d), d). 
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By a Taylor series expansion of A about d we get 

d),d)-fi{w*{i, d), d)] = Ry\A{dy{d-d)+^—{d-dyv^A{d*){d-d) 


pl/2 

= RfVA(dy(d -d) + ^{p[''\d - d)yV^A{d-){pf(d - d)), 

zpi 

where d* is between d and d. As -Ri —)■ cxd, n; —)■ oo for each 1. We first show that the 
gradient VA{d) converges almost surely to a finite constant vector by proving that each one of 
its components, [A{d)]j_i, j = 2,... ,k, converges almost surely as Ri —oo. As n/ —)■ oo for 
/ = 1,..., fc, for j = 2,..., fc, we have 

(.nMEZAW') 




ZlM’id,/n,)j:Z,n{xP) 

£"1, >)) {(w,/ni) ES, u(xP)) 

{ElMd,/n,)EZMxP))" 

a.s. WjEyrjvd^ (YlZi^‘diE^ydl)(wjE^^u) 

RUi'^idiE^^u ZZlZi'^^idiEniuY 

The expression above corresponds to Lj-i, which is defined in ( |D.6| ), and it is finite by as¬ 
sumption A3. Next, we show that the random Hessian matrix V‘^A{d*) is bounded in prob¬ 
ability, i.e., each element of this matrix is Op(l). As —)■ oo for I = for any 

j, f G {2,..., k}J ^ t, we have 

[V A{cl — - 






^ i=l 


U)- 




L(Ef..^El.«(Af))^ 

y^tE^,vd] 


_ 2 Ct. ^ ES. I'l'Ev'")) (iff E”l. «(A'‘’)) 


{YlLi {wtE^.u) 

{YA=i^idiE-„iuY 


{wjE v^f^){wtE^,u) 
-^ - [WjE^M] 


iJ2Ll'^l(^lEniUy 


- 2 


{YlLl'^l^lEniUY 

where the limits are also finite. 

Now, we can rewrite ( |D.7| ) as 

Ry‘^[fi{w* (l,d),d) -rj] = {Ri/A{dyp^^^d - d) + Ry‘^[fj{w* {l,d),d) - p] 

+ E_(ii,/p,)i/2 [p|/=(d - d)] ^ vM(d*) [pf (d - d)] 

2p/ 

= d) -f Ry^[p{w * (l,d),d) - p] + Op(l) 
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Since from 


Tan et al. 


(20151 we have p\^‘^{d — d) A A/'(0, W) and the two sampling stages are 
assumed to be independent, we eonelude that 

[fi[w * (1, d), d) — r]\ A- A/'(0, qL^WL + r^). 

The proof of the CLT for u is similar. As in ( |D.7[ ), we have 

[u[w * (1, d), d) — m/mi] = R^^ [-u(m * (1, d), d) — u{w * (1, d), d)] 

+ R^"^ \u{w * (1, d), d) — m/mi]. 

The asymptotie distribution of the seeond term in ( |D.8| ) is given in Tan et al. s (20151 Theorem 
1. The first term is linear in d — d: 


(D.8) 


u 


i=2 


(m* (l,d),d) -fi(m*(l,d),d) = ^ ^ ^ (^ - A). 


i=l 


(D.9) 


For j = 2,k, the eoeffieient of {dj — dj) in ( |D.9[ ) eonverges almost surely to WjE.„.u, whieh 
is the term Mj_i defined in ( |D.6[ ). 

Finally, from the independenee of the two terms in ( |D.8[ ) we eonelude that 

Ryyu{{l,d),d) -m/mi] 4 A^( 0 , + A). 


E Toy example 

In this seetion, we follow up on the simulation studies that involve t distributions from Seetion 4 
of the main text to verify Theorems 1-3. We also diseuss different weights in forming generalized 
IS estimators and their effeets on estimates of expectations and ratios of normalizing constants. 

Let denote the t-distribution with degree of freedom r and eentral parameter p. We con¬ 
sider 7ri(-) and as the density functions for a and f 5 ,^ 2 =o, respectively. For simplieity, 

let r'i(-) = 7ii{-) for i = 1, 2. Our plan is to first estimate the ratio between the two normalizing 
eonstants, d = m 2 /mi. Then we will study a sea of ^-distributions If = : /i G M} where 

M is a fine grid over [0,1], say M = {0, .01, • • ■ , .99, 1}. For eaeh p G M, we assume that 
= 7r^(-) and we estimate the ratio between its normalizing eonstant and mi, denoted by 
4 estimate the expectation of each distribution in If, denoted Et^^X or E^X 

for short. Clearly, the exaet answers are d = d^ = 1 and E^X = p for any p G M. Nevertheless, 
we follow the two-stage procedure from Sections 2 and 3 to generate Markov ehains from tti 
and tt 2 and build MCMC estimators from Theorems 1-3. The primary goal is to eompare the 
performanee of BM and RS estimators. 
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We draw iid samples from tti and Markov chain samples from 112 using the so called inde¬ 
pendent Metropolis Hastings algorithm with proposal density For RS, we follow the idea of 
Mykland et al. (|1995| Section 4.1) on constructing minorization conditions to identify regenera¬ 


tion times. Based on a carefully tuned minoration condition, the Markov chain for 112 regenerates 
about every 3 iterations on average. In contrast, for users of the BM method proposed in this pa¬ 
per, no such theoretical development is needed. For i = 1, 2 we draw Ni observations from vr* in 
stage 1 and n* observations from vTj in stage 2. We set Ni = N 2 and ni = ^2 = iVi/10 = iV2/10. 
Recall the reason for smaller stage 2 sample sizes is due to computing cost. For completeness, 
note generating Markov chain samples using RS results in a random chain length so these chains 
were run in such a way that Ni ~ N 2 and rii ~ ^ 2 . 

For estimators based on stage 1 samples. Theorem 1 allows any choice of weight, For 
estimators based on stage 2 samples. Theorem 2 and 3 allow any choice of weight, in con¬ 
structing consistent BM estimators of the asymptotic variances. RS based estimators in stage 2 
are calculated using Theorems stated in Doss and Tan (20141 and Tan et al. ( |2015 ) with a general 
weight choice noted in Remark 4. This is an important generalization in that now any non¬ 
negative numerical weight vector can be used. We discuss the choice of weights and their impact 
on the estimators later in this section. 

The following details the simulation study presented in the main text. We consider increasing 
sample sizes from A^i = 10^ to 10® in order to examine trace plots for BM and RS estimates. 
The two stage procedure is repeated 1000 times independently. The unknown true value of the 
asymptotic variance of d is estimated by its empirical asymptotic variance over the 1000 repli¬ 
cations at Ni = 10®. We consider the naive weight, = (0.5, 0.5), that is proportional to 
the sample sizes, and an alternative = (0.82,0.18) that weighs the iid sample more than 
the Markov chain sample. As illustrated in Figure 1 of the main text, both the BM and the RS 
estimates approach the empirical asymptotic variance as the sample size increases suggesting 
consistency. Similarly for stage 2, Figure shows convergence of the BM and the RS estimates 
to the corresponding empirical asymptotic variances of and E^{X). Plots for other jj, E M 
show similar results, but are not included here. 

Overall, the simulation study suggests BM and RS methods provide consistent estimators for 
the true asymptotic variance. RS estimators enjoy smaller mean squared error in most cases. 
Nevertheless, when the number of regenerations is not great, BM estimators could be the more 
stable estimator. For example, in the top left panel of Figure]^ at stage 2 sample size 77-2 ~ rii = 
100, or about 35 regenerations for chain 2, the RS method substantially over-estimated the target 
in about 5% of the replications. Further, in the cases where regeneration is unavailable or the 
number of regenerations is extremely small, then BM would be the more viable estimator. 
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Figure 4: Estimates of the asymptotie varianee of (upper panels) and (lower panels) in 

stage 2, with naive weight = (.5, .5). 


E.l Choice of stage 1 weights 


For stage 1, we reeommend obtaining a elose-to-optimal weight using a pilot study de- 


seribed in Doss and Tan (2014). In short, one ean generate samples of small size from tti and 
7r2, estimate d and its asymptotie varianee based on Theorem 1 for a grid of weights, and then 
identify the weight that minimizes the estimated varianee. With a small pilot study based on sam¬ 
ples of size 1000 from both distributions, we obtained = (0.82,0.18). As depieted by the 

horizontal lines aeeross the pietures in Figure 1 of the main text, the asymptotie varianee of the 
estimator d based on a[^’°P‘] is approximately 0.07, whieh is more than 30% smaller than of the 
estimator based on the naive ehoiee = (.5, .5). Note that the naive weight is proportional to 
the sample sizes from tti and 7r2, whieh is asymptotieally optimal if both samples were indepen¬ 
dent. However, sinee sample 2 is from a Markov ehain sample, using a weight that appropriately 
favors the independent sample has lead to smaller error in the estimator. The gain in effieieney 
using a elose-to-optimal weight will be more pronouneed if the differenee in the mixing rates of 
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the two samples is larger. 


E.2 Choice of stage 2 weights 

In stage 2, for eaeh jj, G M, the asymptotie varianee of and E^{X) are minimized at different 
weights. Instead of searching for each of the 2\M\ optimal weights in a pilot study, it is more 
practical to set sub-optimal weights using less costly strategies. Below, we perform a simulation 
study to examine three simple weighting strategies: 


1. naive: cx (ni,?7,2), 

2. inverse distance (inv-dist): oc ( )’ 


3. effective sample size (ess) by inverse distance (inv-dist): oc 


essi 


ess2 




Using each of the three strategies, we construct generalized IS estimators for and E^{X) 
for a grid of /i values between —1.5 and 4. Note that samples are drawn from two reference 
distributions indexed by /r = 1 and /r = 0 respectively. Hence our simulation study concerns 
both interpolation and extrapolation. A summary of their performance is provided in Figure 
and detailed results for selected simulation setups are shown in Figures and |^for strategies 1, 
2, and 3, respectively. Figure [^suggests that none of the three strategies is uniformly better than 
the others. In particular, we observe the following. 


1. For estimating d^ 

(a) For /i G (0,1), strategy 2 works the best. 

(b) For /r = 0, strategies 2 and 3 work better than strategy 1. Indeed, both of them simply 
set their stage 2 estimates do to be the stage 1 estimate, d. This would be a better 
choice than strategy 1 because in a two-step procedure, stage 1 chains are often much 
longer than stage 2 chains, and hence d is already a very accurate estimate for do = d. 

(c) For jj, ^ [0,1], strategies 2 and 3 generally lead to more stable estimates of d^. How¬ 
ever, all strategies lead to very large asymptotic variances for /i < 0. Hence, one 
needs to be mindful when doing extrapolation with IS estimators — always obtain an 
estimate of the standard error, or reconsider the placement of the reference points. 

2. For estimating EM) 

(a) For /i G (0,1), strategy 2 works the best in general, while strategy 3 is very unstable. 
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Figure 5: Comparisons of three weight strategies in terms of the asymptotic variance of the cor¬ 
responding estimators and E^{X). The solid dots show which strategy achieves the smallest 
asymptotic variance among the three at any given /r (ties awarded to the more basic strategy). 


(b) For either /i = 0 or 1, strategy 2 and 3 are the same, and they only utilize the reference 
chain from /i. This was a wise choice for estimating as explained before, but not 
so for other quantities of interest. 

(c) For /r ^ [0,1], all strategies lead to fairly large asymptotic variances, especially for 

/i < 0. 

Overall in stage 2, strategy 2 has an advantage when the estimands are ratios between nor¬ 
malizing constants. However, when estimating E^{X), the situation is more complicated. Our 
impression is that assigning any extreme weight will lead to high variability in the estimator. So 
it is reasonable to simply use the naive weight, or other strategies that bound the weights away 
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logic of chain 1 length in stage 2 


Figure 6: Estimates of the asymptotie varianee of (upper panels) and Efj,{X) (lower panels) in 

stage 2, with weight (/r) ehosen by strategy 2. 


from 0 and 1. 


F Bayesian variable selection models 

Here, we eonsider a elass of Bayesian variable seleetion (BVS) models for linear regression with 
independent normal priors on the regression eoeffieients. This model involves a 2-dimensional 
prior hyperparameter that infiuenees inferenee, yet no default ehoiee guarantees good perfor- 
manee in praetiee. Henee, displaying the effeet of different hyperparameter values on the poste¬ 
rior distribution would greatly benefit users of the model. When the number of predietors, q, is 
large, the eomputing is ehallenging. Our solution is to obtain MCMC samples for a small num¬ 
ber of models with different hyperparameter values, based on whieh generalized IS estimates ean 
be obtained for BFs and other posterior expeetations for a large number of models. Again, an 
important problem in praetiee is how long the Markov ehains need to be run? In this eontext, the 
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Figure 7: Estimates of the asymptotie varianee of (upper panels) and Efj,{X) (lower panels) in 
stage 2, with naive weight (/x) ehosen by strategy 3. 


only affordable method that we are aware of is to estimate the SE of these IS estimators using the 
proposed BM method. 

As introdueed by Mitehell and Beauehamp (1988), let Y = (Yi,..., denote the veetor 
of responses and Xi,... ,Xq denote q potential predietors, eaeh a veetor of length m. The pre- 
dietors are standardized, so that for j = 1 ,..., g, = 0 and XjXj = m, where Im is the 

veetor of m I’s. The BVS model is given by: 


given 7 , cT^/3o,/? 7 , 
given 7 , /3o, 

given 7 , 


Y ~ A/'m(lm/3o + a^I), 

/3j ~ Ar( 0 , ^cr^) for j = 1 ,.. . , q, 
(cr^/3o) ~ p(/3o,cr^) oc 1/cr^ 

7 ~ p{'y) = . 


(F.la) 

(F.lb) 

(F.le) 

(Eld) 


The binary veetor 7 = ( 71 ,..., 7 ^)^ G {0, identifies a subset of predietors, sueh that Xj is 


40 

























included in the model if and only if 7 ^ = 1 , and I 7 I = denotes the number of predictors 


included. So ( |F.la[ ) says that each 7 corresponds to a model given hy Y = ImPo + + e, 

where is an n x I 7 I sub-matrix of X that consists of predictors included by 7 , (3^ is the vector 
that contains corresponding coefficients, and e ~ A/"m(0, a^J). It is sometimes more convenient 
to use the notation, Y = Xq^Pq^ + e, where Xq^ has one more column of I’s than X^ and 
/^07 = (/^Oi )■ Unknown parameters are 0 = ( 7 , a, /^o, (3^) for which we set a hierarchical prior 


in ( |F.lb| ) to ( |Fld[ ). In ( |F.ld[ ), an independent Bernoulli prior is set for 7 , where w G (0,1) is a 
hyperparameter that reflects the prior inclusion probability of each predictor. In ( |F.Ic[ ), a non- 
informative prior is set for ((T^,/3 o)- In ( |F.Ib| ), an independent normal prior is assigned to (3^, 
where A > 0 is a second hyperparameter, that controls the precision of the prior. Overall, 6 is 
given an improper prior due to ( |F.lc| ) but the posterior of 9 is indeed proper. 

One can actually integrate out {(3^, l3o^a‘^) and arrive at the following model with parameter 
7 only: 


r|7~4(7;U) = 



71 


/(U|7, /5o, 4)/(/^7l7, /5o)/(c^^ (3o)d(3^d(3oda'^ 


= CmX'-^\Ao,\ ^^[{Y-Yf{Y-Y)-l3^Ao^l3^] 

7 ~ P/i(7) = 


(F.2) 


Here, Cm is a constant depending only on the sample size m. Further, Aq^ = Xq^Xq^ + Aq^, 
where Ao..y is a diagonal matrix, the main diagonal of which is the (1 + | 7 |)-dimensional vector 
(0, A, ■ ■ ■ , A). Finally, ^ = A^^^X^Y. 


Using the model at ( |F.2[ ) requires specification of the hyperparameter h = {w, A). Smaller w 
values assign high prior probabilities to models with fewer predictors, and priors with smaller A 
values allow selected predictors to have large coefficients. It is common to set w = 0.5 (a uniform 
prior on the model space) and A = 1 (a unit information prior for uncorrelated predictors, see e.g. 
Kass and Raftery| ( |l995[ )). One can also choose h adaptively, say according to the marginal likeli¬ 
hood mh = y)Ph{'l)- A small value of nih indicates that the prior ph is not compatible 

with the observed data, while /ieb = argmaxm/i is defined to be the empirical Bayesian choice 
of h. The empirical Bayes idea has been successfully applied to various models with variable 
selection components (see e.g. [George and Foster](|2000[); [Yuan and Lin| (|2005|)). However, we 


have not seen this idea being carried out for the model in (F.l), except where n = p and the de¬ 


sign matrices are orthogonal (Clyde and George (2000); Johnstone and Silverman (20051). Due 


to the improper prior in ( |F.ld| ), rrih is not uniquely defined. Nevertheless, the Bayes factor among 
any two models, say 171 ^/ 171 ^ 1 , is well-defined because the same improper prior is assigned to the 
shared parameters of the two models (see e.g. [Kass and Rafter^ ( |1995| sec.5) and [Liang et al. 
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(2008 sec. 2 )). 


Here, we concentrate on two goals. The first is to evaluate [mh/mh^, h G H}, the marginal 
likelihood of model h relative to a reference model hi, which allows us to identify the empirical 
Bayesian choice of h. The second is to evaluate the posterior mean of the vector of coefficients 
for each h G T-L, which we denote by b^. Predictions can then be made for new observations 
using 

For model ( |F.2[ ) with a fixed h, a Metropolis Hastings random-swap algorithm ([Clyde et~aL 


(2011)) can be used to generate Markov chains of 7 from its posterior distribution. In each 
iteration, with probability p{'j), we propose flipping a random parr of 0 and 1 in 7, and with 
probability 1 — ^( 7 ), we propose changing 7 ^ to 1 — 7 ^ for a random j while leaving other 
coordinates untouched. We set ^(7) = 0 when 7 corresponds to the null model or the full model, 
and p( 7 ) = .5 otherwise. Finally, the proposal is accepted with an appropriate probability. Since 
this Markov chain lies on a finite state space, it is uniformly ergodic and hence polynomially 
ergodic as well. Further, moment conditions in Theorems 2 and 3 are satisfied because they 
reduce to summations of 2'^ terms, a large but finite number. To achieve the goal, we generate 


Markov chains of 7 with respect to model ( |F.2| ) at several h values that scatters in "H, from which 
we build generalized IS estimators, and estimate their standard errors. 


F.l Cookie dough data 


We demonstrate the aforementioned sensitivity analysis using the biscuit dough dataset (Brown 


et al. (2001); Osborne et al. (1984)). The dataset, available in the R package ppls (Kraemer and 


Boulesteix] ( |2012| )), contains a training set of 39 observations and a test set of 31 observations. 
These data were obtained from a near-infrared spectroscopy experiment that study the composi¬ 
tion of biscuit dough pieces. For each biscuit, the reflectance spectrum is measured at 700 evenly 
spaced wavelengths. We use these measurements as covariates to predict the response variable. 


the percentage of water in each dough. We follow previous studies (Hans (2011)) and thin the 
spectral to g = 50 evenly spaced wavelengths. 

Figure [^provides a general picture of the sensitivity analysis. The left plots provide two ways 
to visualize estimates of the BFs. To form the plot, we took the 12 reference values ofh= {w, A) 
to be such that (te, — log(A)) G {0.1, 0.2, 0.3, 0.4} x {1,3,5}. In stage 1 we ran each of the 
12 Markov chains at the above values of h for 10^ iterations to obtain d. In stage 2, we ran the 
same 12 Markov chains for 50,000 iterations each, to form the estimates iin over a fine grid that 
consists of 475 different h values, with the w component ranging from 0.05 to 0.5 in increments 
of 0.025 and the — log(A) component ranging from 0 to 6 in increments of 0.25. 


42 












































Figure 8: Left panels provide a surface plot and a contour plot for BF estimates. The upper-right 
panel displays standard errors with respect to the BF estimates. The lower-right panel shows 
pmse over the test set. 


How trustworthy are these BF estimates? Their estimated standard errors are obtained using 
the BM method, based on Theorem 2. We choose to display the relative SE with respect to the 
BF estimates, as shown in the upper right panel of Figure The relative SEs are smaller than or 
equal to 5%, and we believe the BF estimates are accurate enough. Finally, the lower-right panel 
of Figure]^ shows the prediction mean squared error (pmse) over the test set for all h. 

Based on our estimation, the BF attains the maximum value 9.75 at /irb = (0.075, e“®). 
Recall when comparing any two models indexed by h and h' respectively, the BF between them 
is given by BFh^h> = • Also, according to 


Jeffreys 


1998)and 


Kass and Raftery (19951, 
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EB method, (w, - log (X )) = (0.075, 5) 


Standard method, (w, - log (X)) = (0.5, 0) 



wavelength 



Figure 9: Estimated posterior mean of regression coefficients (with 95% point-wise confidence 
intervals) for the empirical Bayesian and the standard choice of h, respectively. 

the evidence for h over h' is considered to be strong only if BF/j is greater than 10 or 20. Hence, 
all h with BE over 1/10 or 1/20 times the maximum BE can be considered as reasonably well 
supported by the data as that of the empirical Bayesian choice. Comparing the lower two plots 
of Figure we see that the set Ac := {h G T-L : BE > c} for c = 1 and 0.5 do overlap with an 
area that corresponds to relatively small pmse. Outside v4o.5> a region that consists of larger w 
and smaller — log(A) also enjoys small pmse values, at around 0.3 to 0.4. This region includes 
the common choice of ho = (0.5, e°). These suggest that /ieb and its vicinity might not be the 
only area of h that has good prediction performances. 

To better compare the effect of /z-eb = (0.075, e“^) and the commonly used h = (0.5, e°). 
Figure displays the estimated posterior mean of regression coefficients at both choices of h, 
together with the point-wise 95% confidence intervals for the posterior means. Due to the small 
size of Web and Aeb, the empirical Bayesian method yields a model with a few covariates that 
have big coefficients. In comparison, the common choice has larger w and A values, leading to 
a regression model that combines more covariates each having smaller effects. It turns out these 
two opposite strategy of modeling both predict the test dataset well, with pmse being 0.411 and 
0.431 respectively. For comparison, pmses were calculated for several frequentist penalized linear 
regression methods with their respective penalty parameters chosen by 10-fold cross validation. 
The resulting pmses for the ridge, the lasso and the elastic net method are 4.675, 0.633 and 0.536, 
respectively. 

The BM method for estimating SE is carried out above without the need of further user 
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input. Theoretically, its competitor RS can be developed too, if enough regeneration times can 
be identified for each Markov chain. Recall that with the random-swap algorithm, each Markov 
chain lives on the discrete state space T of size 2^. A naive way to introduce regeneration is to 
specify a single point 71 , then each visit of the Markov chain to 71 marks a regeneration time. 
Note that the chance of visiting 71 converges to 7 r( 7 i), the posterior probability of 71 . In our 
BVS model with ^ 1.1 X 10^^ states of 7 , even max.ygr 7 r( 7 ) could be very small. Take for 
example the Markov chain for the BVS model with h = {w, log( 5 f)) = (.4,1), the point with 
the highest frequency appeared only 8 out of a run of 10^ iterations. And that the waiting times 
between consecutive regenerations are highly variable, which ranges from less than ten iterations 
to a few thousand iterations. To obtain alternative schemes of identifying regeneration times, one 
can take the general minorization condition approach. It could potentially increase the chance of 
regeneration and reduce variability of the waiting times. Specifically, for any a G {1, 2, ■ ■ ■ , 2'^}, 
one could define Da to contain the a points with the highest posterior probabilities, and find 
Cq, e ( 0 , 1 ] and a probability mass function kai-) such thatp( 7 '| 7 ) > ealDa{l)ka{l') for all 
7 ' G r. Note that as a increases, the chance of visiting Da improves, but the conditional rate 
of regeneration given the current state 7 is in Da, would decrease sharply. Finding a good a to 
maximize the overall chance of regeneration requires tuning that is specific for both the model 
specification h and the dataset. Even if we can find the optimal a for each Markov chain used in 
the example, it is unlikely that all of them would regenerate often enough for the RS estimator to 
be stable. 
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